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Abstract 

Word representations induced from models 
with discrete latent variables (e.g. HMMs) 
have been shown to be beneficial in many 
NLP applications. In this work, we ex¬ 
ploit labeled syntactic dependency trees 
and formalize the induction problem as un¬ 
supervised learning of tree-structured bid¬ 
den Markov models. Syntactic functions 
are used as additional observed variables in 
tbe model, influencing both transition and 
emission components. Such syntactic in¬ 
formation can potentially lead to capturing 
more fine-grain and functional distinctions 
between words, which, in turn, may be de¬ 
sirable in many NLP applications. We eval¬ 
uate the word representations on two tasks 
- named entity recognition and semantic 
frame identification. We observe improve¬ 
ments from exploiting syntactic function 
information in both cases, and the results 
rivaling those of state-of-the-art represen¬ 
tation learning methods. Additionally, we 
revisit the relationship between sequential 
and unlabeled-tree models and find that the 
advantage of the latter is not self-evident. 

1 Introduction 

Word representations have proven to be an indis¬ 
pensable source of features in many NLP systems 
as they allow better generalization to unseen lex¬ 
ical cases (Koo et al., 2008; Turian et al., 2010; 
Titov and Klementiev, 2012; Passos et al., 2014; 
Belinkov et al., 2014). Roughly speaking, word 
representations allow us to capture semantically 
or otherwise similar lexical items, be it categori¬ 
cally (e.g. cluster ids) or in a vectorial way (e.g. 
word embeddings). Although the methods for ob¬ 
taining word representations are diverse, they nor¬ 
mally share the well-known distributional hypoth¬ 


esis (Harris, 1954), according to which the simi¬ 
larity is established based on occurrence in similar 
contexts. However, word representation methods 
frequently differ in how they operationalize the 
definition of context. 

Recently, it has been shown that representations 
using syntactic contexts can be superior to those 
learned from linear sequences in downstream tasks 
such as named entity recognition (Grave et al., 
2013), dependency parsing (Bansal et al., 2014; 
Sagae and Gordon, 2009) and PP-attachment dis¬ 
ambiguation (Belinkov et al., 2014). They have 
also been shown to perform well on datasets for 
intrinsic evaluation, and to capture a different type 
of semantic similarity than sequence-based repre¬ 
sentations (Levy and Goldberg, 2014; Suster and 
van Noord, 2014; Pado and Lapata, 2007). 

Unlike the recent research in word representa¬ 
tion learning, focused heavily on word embeddings 
from the neural network tradition (Collobert and 
Weston, 2008; Mikolov et al., 2013a; Pennington 
et al., 2014), our work falls into the framework of 
hidden Markov models (HMMs), drawing on the 
work of Grave et al. (2013) and Huang et al. (2014). 
An attractive property of HMMs is their ability to 
provide context-sensitive representations, so the 
same word in two different sentential contexts can 
be given distinct representations. In this way, we 
account for various senses of a word. ^ However, 
this ability requires inference, which is expensive 
compared to a simple look-up, so we explore in 
our experiments word representations that are orig¬ 
inally obtained in a context-sensitive way, but are 
then available for look-up as static representations. 

Our method includes two types of observed vari¬ 
ables: words and syntactic functions. This allows 
us to address a drawback of learning word repre¬ 
sentation from unlabeled dependency trees in the 

'The handling of polysemy and homonymy typically re¬ 
quires extending a model in other frameworks, cf. Huang et 
al. (2012), Tian et al. (2014) and Neelakantan et al. (2014). 




Figure 1: Hidden Markov tree model with syntaetie funetions, r, as additional observed layer. 


eontext of HMMs (§ 2). The motivation for includ¬ 
ing syntactic functions comes from the intuition 
that they act as proxies for semantic roles. The 
current research practice is to either discard this 
type of information (so context words are deter¬ 
mined on the syntactic stmcture alone (Grave et ah, 
2013)), or include it in a preprocessing step, i.e. by 
attaching syntactic labels to words, as in Levy and 
Goldberg (2014). 

We evaluate the word representations in two 
structured prediction tasks, named entity recogni¬ 
tion (NER) and semantic frame identification. As 
our extension builds upon sequential and unlabeled- 
tree HMMs, we also revisit the basic difference 
between the two, but are unable to entirely corrob¬ 
orate the alleged advantage of syntactic context for 
word representations in the NER task. 

2 Why syntactic functions 

A word can typically occur in distinct syntactic 
functions. Since these account for words in dif¬ 
ferent semantic roles (Bender, 2013; Levin, 1993), 
the incorporation of the syntactic function between 
the word and its parent could give us more precise 
representations. For example, in “Carla bought the 
computer”, the subject and the object represent two 
different semantic roles, namely the buyer and the 
goods, respectively. Along similar lines, Pado and 
Lapata (2007), Suster and van Noord (2014) and 
Grave et al. (2013) argue that it is inaccurate to 
treat all context words as equal contributors to a 
word’s meaning. 

In HMM learning, the parameters obtained from 
training on unlabeled syntactic structure encode 
the probabilistic relationship between the hidden 
states of parent and child, and that between the 
hidden state and the word. The tree structure thus 
only defines the word’s context, but is oblivious 
of the relationship between the words. For exam¬ 
ple, Grave et al. (2013) acknowledge precisely this 
limitation of their unlabeled-tree representations 


by providing as example the hidden state of a verb, 
which cannot discriminate between left (e.g. sub¬ 
ject) and right (e.g. object) neighbors because of 
shared transition parameters. This adversely af¬ 
fects the accuracy of their super-sense tagger for 
English. Similarly, Suster and van Noord (2014) 
show that filtering dependency instances based on 
syntactic functions can positively affect the quality 
of obtained Brown word clusters when measured 
in a wordnet similarity task. 

3 A tree model with syntactic functions 

We represent a sentence as a tuple of K 
words, w = {wi,... ,wk), where each Wk G 
{1,...,|V|} is an integer representing a word 
in the vocabulary V. The goal is to infer a tu¬ 
ple of K states c = (ci,..., Ci^-), where each 
Ck G {1,..., A} is an integer representing a se¬ 
mantic class of Wk, and N is the number of states, 
which needs to be set prior to training. Another 
possibility is to let tu^’s representation be a prob¬ 
ability distribution over N states. In this case, we 
denote tu^’s representation as G M^. 

As usual in Markovian models, the generation of 
the sentence can be decomposed into the generation 
of classes (transitions) and the generation of words 
(emissions). The process is defined on a tree, in 
which a node Ck is generated by its single parent 
where vr : {1,..., A} {0,..., K}, with 
0 representing the root of the tree (the only node not 
emitting a word). We denote a syntactic function as 
r G {ri,..., rs}, where S is the total number of 
syntactic function types produced by the syntactic 
parser. We encode the syntactic function at position 
k as rk = , i-e. the dependency relation 

between Wk and its parent. 

We would like the variable r to modulate the 
transition and emission processes. We achieve this 
by drawing on the Input-output HMM architecture 
of Bengio and Erasconi (1996), who introduce a 
sequential model in which an additional sequence 



of observations called input becomes part of the 
model, and the model is used as a conditional pre¬ 
dictor. The authors describe the application of their 
model in speech processing, where the goal is to 
obtain an accurate predictor of the output phoneme 
layer from the input acoustic layer. Our focus is, in 
contrast, on representation learning (hidden layer) 
rather than prediction (output layer). Also, we 
adapt their sequential topology to trees. 

The probability distribution of words and seman¬ 
tic classes is conditional on syntactic functions and 
is factorized as: 

K 

p(w, c|r) = n p{wk\ck,rk)p{ck\c^(^k),rk), 

k=l 

(3.1) 

where encodes additional information about Wk, 
in our case the syntactic function of Wk to its parent. 
This is represented graphically in fig. 1. 

The parameters of the model are stored in 

column-stochastic transition and emission matri- 

2 

ces : 

T, where Tiji = p{ck=i \ c^(k)=j, rk=l) 

O, where Oiji = p{wk=i \ Ck=j, rk=l) 

The number of required parameters for representing 
the transitions is 0{N‘^ S), and for representing the 
emissions 0(A^ |')/’| 5). 

Our model satisfies fhe single-parenf consfrainf 
and can be applied fo proper trees only. It is in 
principle possible to extend the base representation 
for the model by using approximate inference tech¬ 
niques that work on graphs (Murphy, 2012, p. 720), 
but we do not explore this possibility here.^ 

As opposed to an unlabeled-tree HMM, our 
extension can in fact be categorized as an 
mhomogeneous model since the transition and 
emission probability distributions change as a func¬ 
tion of input, cf. Bengio (1999). Another com¬ 
parison concerns the learning of long-term depen¬ 
dencies: since in the Input-output architecture the 
transition probabilities can change as a function 
of input at each k, they can be more deterministic 
(have lower entropy) than the transition probabili¬ 
ties of an HMM. Having the transition parameters 
closer to zero or one reduces the ambiguity of the 
next state and allows the context to flow more eas¬ 
ily. A concrete graphical example is given in fig. 
2 . 

^We are abusing the terminology slightly, as these are in 
fact three-dimensional arrays. 

^This would be relevant for dependency annotation 
schemes which include secondary edges. 



Figure 2: The fransifion probabilities of a tree 
HMM with syntactic functions (synfunc) are 
sparser and have a lower entropy (5.34) than those 
of an unlabeled-tree HMM (tree', entropy of 5.6). 


4 Learning and inference 


We train the model with the Expectation- 
Maximization (EM) algorithm (Baum, 1972) and 
use the sum-product message passing for inference 
on trees (Pearl, 1988). The inference procedure 
(the estimation of hidden states) is the same as in 
an unlabeled-tree model, except that it is performed 
conditionally on r. 

The parameters T and O are estimated with max¬ 
imum likelihood estimation. In the E-phase, we 
obtain pseudo-counts from the existing parameters, 
as shown in fig. 3. The M-sfep fhen normalizes fhe 
transition pseudo-counts (and similarly for emis¬ 
sions): 


Tiji — 


Ylj’ '’'ij'l 


(4.1) 


4.1 State splitting and merging 

We explore the idea of introducing complexity grad¬ 
ually in order to alleviate the problem of EM find¬ 
ing a poor solution, which can be particularly se¬ 
vere when the search space is large (Petrov et ah, 
2006). The splitting procedure starts with a small 
number of states, splits the parameters of each state 
s into Si and S 2 by cloning s and slightly perturb¬ 
ing. The model is retrained, and a new split round 
takes place. To allow splitting states to various 
degrees, Petrov et al. also merge back those split 
states which improve the likelihood the least. Al¬ 
though the merge step is done approximately and 
does not require new cycles of inference, we find 
fhaf fhe exfra running fime does nof jusfify fhe spo- 
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Figure 3: Obtaining pseudo-counts, or expected sufficient statistics, in the E-step. 


radio improvements we observe. We settle there¬ 
fore on the splitting-only regime. 

4.2 Decoding for HMM-based models 

Once a model is trained, we can search for the most 
probable states given observed data by using the 
max-product message passing (Max-Product, a 
generalization of Viterbi) for efficient decoding on 
trees: c = argmaX(,p(C' = c | VF = w, i? = r). 

We have also tried posterior (or minimum 
risk) decoding (Lember and Koloydenko, 2014; 
Ganchev et al., 2008), but without consistent im¬ 
provements. 

The search for the best states can be avoided by 
taking the posterior state distribution over N 
hidden states (Nepal and Yates, 2014; Grave et ah, 
2014): = E[l{Ck = c} \ W = w, R = r]. 

We call this vectorial representation Post-Token. 

In both cases, inference is performed on a con¬ 
crete sentence, thus providing a context-sensitive 
representation. We find in our experiments 
that Post-Token consistently outperforms Max- 
Product due to its ability to carry more informa¬ 
tion and uncertainty. This can then be exploited by 
the downstream task predictor. 

One disadvantage of obtaining context-sensitive 
representations is the relatively costly inference. 
The inference and decoding are also sometimes not 
applicable, such as in information retrieval, where 
the entire sentence is usually not given (Huang et 
al., 2011). A trade-off between full context sensitiv¬ 
ity and efficiency can be achieved by considering a 
static representation (Post-Type). It is obtained 
in a context-insensitive way (Huang et al., 2011; 
Grave et al., 2014) by averaging posterior state dis¬ 
tributions (context-sensitive) of all occurrences of 
a word type w from a large corpus U: 

1 E (4.2) 

iG.U\Wi=w 

In fig. 4, we give a graphical example of the word 
representations learned with our model (§ 5.5), ob¬ 


tained either with the Post-Token or the POST- 
Type. To visualize the representations, we apply 
multidimensional scaling.^ The model clearly sep¬ 
arates between management positions and parts of 
body, and interestingly, puts “head” closer to man¬ 
agement positions, which can be explained by tbe 
business and economic nature of tbe Bllip corpus. 
Tbe words “chief” and “executive” are located to¬ 
gether, yet isolated from others, possibly because 
of tbeir strong tendency to precede nouns. Tbe ar¬ 
row on the plot indicates the shift in the meaning 
when a Post-Token representation is obtained 
for “head” (part-of-body) within a sentence. 


_head_ hand 

("The head louse is one of the three 



Figure 4: Representations obtained with our model 
with syntactic functions. All are static Post-Type 
representations, except “_head_”, which is obtained 
with Post-Token from the concrete sentence. 

Despite the advantage of Post-Token to ac¬ 
count for word senses, we observe that Post-Type 
performs better in almost all experiments. A likely 
explanation is that averaging increases the general- 
izability of representations. For the concrete tasks 
in which we apply the word representations, the 
increased robustness is simply more important than 
context sensitivity. Also, Post-Type might be less 
sensitive to parsing errors during test time. 
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https://github.com/scikit-learn/scikit-learn 









5 Empirical study 

5.1 Parameters and setup 

We observe faster convergence times with online 
EM, which updates the parameters more frequently. 
Specifically, we use the mini-batch step-wise EM 
(Liang and Klein, 2009; Cappe and Moulines, 
2009), and determine the hyper-parameters on 
the held-out dataset of 10,000 sentences to maxi¬ 
mize the log-likelihood. We find out that higher 
values for the step-wise reduction power a and 
the mini-batch size lead to better overall log- 
likelihood, but with a somewhat negative effect 
on the convergence speed. We finally settle on 
a = {0.6,0.7, 0.8, 0.9,1} and mini-batch size of 
(1,10,100,1000} sentences. We find that a couple 
of iterations over the entire dataset is sufficient to 
obtain good parameters, cf. Klein (2005). 
Initialization. Since the EM algorithm in our 
setting only finds a local optimum of the log- 
likelihood, the initialization of model parameters 
can have a major impact on the final outcome. We 
initialize the emission matrices with Brown clusters 
by first assigning random values between 0 and 1 to 
the matrix elements, and then multiplying those el¬ 
ements that represent words in a cluster by a factor 
of / G {10,100, IK , lOitT). Einally, we normalize 
the matrices. This technique incorporates a strong 
bias towards word-class emissions that exist (de¬ 
terministically) in Brown clusters. The transition 
parameters are simply set to random numbers sam¬ 
pled from the uniform distribution between 0 and 
1, and finally normalized. 

Approximate inference. Eollowing Grave et 
al. (2013) and Pal et al. (2006), we approximate the 
belief vectors during inference,^ which speeds up 
learning and works as regularization. We use the 
/c-best projection method, in which only /c-largest 
coefficients (in our case k = |A^) are kept. 

5.2 Data for obtaining word representations 

English. We use the 43M-word Bllip corpus (Char- 
niak et al., 2000) of WSJ texts, from which we 
remove the sentences of the PTB and those whose 
length is < 4 or > 40. We use the MST depen¬ 
dency parser (McDonald and Pereira, 2006) for 
English and build a projective, second order model, 
trained on sections 2-21 of the Penn Treebank WSJ 

^In a bottom-up pass, a belief vector represents the local 
evidence by multiplying the messages received from the chil¬ 
dren of a node, as well as the emission probability at that 
node. 


(PTB). Prior to that, the PTB was patched with 
NP bracketing rules (Vadas and Curran, 2007) and 
converted to dependencies with LTH (Johansson 
and Nugues, 2007). The parser achieves the unla¬ 
beled/labeled accuracy of 91.5/85.22 on section 23 
of the PTB without retagging the POS. For POS- 
tagging the Bllip corpus and the evaluation datasets, 
we use the Citar tagger After parsing, we replace 
the words occurring less than 40 times with a spe¬ 
cial symbol to model OOV words. This results in 
the vocabulary size of 27,000 words. 

Dutch. We first produce a random sample of 2.5M 
sentences from the SoNaR corpus’, then follow 
the same preprocessing steps as for English. We 
parse the corpus with Alpino (van Noord, 2006), an 
HPSG parser with a maxent disambiguation com¬ 
ponent. In contrast with English, for which we use 
word forms, we keep here the root forms produced 
by the parser’s lexical analyzer. The resulting vo¬ 
cabulary size is 25,000 words. The analyses pro¬ 
duced by the parser represent multiple parents to 
facilitate the treatment of wh-clauses, coordination 
and passivization. Since our method expects proper 
trees, we convert the parser output to CoNLL for¬ 
mat.^ 

5.3 Evaluation tasks 

Named entity recognition. We use the standard 
CoNLL-2002 shared task dataset for Dutch and 
CoNLL-2003 dataset for English. We also include 
the out-of-domain MUC-7 testset, preprocessed 
according to Turian et al. (2010). We refer the 
reader to Ratinov and Roth (2009) for a detailed 
description of the NER classification problem. 

Just like Turian et al. (2010), we use the averaged 
structured perceptron (Collins, 2002) with Viterbi 
as the base for our NER system.^ The classifier is 
trained for a fixed number of iterations, and uses 
these baseline features: 

• Wk information: is-alphanumeric, all-digits, 
all-capitalized, is-capitalized, is-hyphenated; 

• prefixes and suffixes of w^', 

• word window: Wk,Wk±i,Wk± 2 ', 

• capitalization pattern in the word window. 

We construct N real-valued features for a word 

vector of dimensionality N, and a simple indicator 
feature for a categorical word representation. 

^http://github.com/danieldk/citar 
http://lands.let.ru.nl/projects/SoNaR 
http://www.let.rug.nl/bplank/alpino2conll/ 
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http://github.com/LxMLS/lxmls-toolkit 



Semantic frame identification. Frame-semantic 
parsing is the task of identifying a) semantic frames 
of predicates in a sentence (given target words 
evoking frames), and b) frame arguments partici¬ 
pating in these events (Fillmore, 1982; Das et ah, 
2014). Compared to NER, in which the classi¬ 
fication decisions apply to a relatively small set 
of words, the problem of semantic frame identifi¬ 
cation concerns making predictions for a broader 
set of words (verbs, nouns, adjectives, sometimes 
prepositions). 


Fluoride 

inhibits 

the deveiopment 

of caries 


Hindering 

CAUSE_TO_MAKE_PnC)GHESS 




■ 

Project 

HlndrarKe 


Action 



Figure 5: A parse with FliNDERiNG and 
Cause_to_make_progress frames with respec¬ 
tive arguments. 

We use the Semafor parser (Das et ah, 2014) con¬ 
sisting of two log-linear components trained with 
gradient-based techniques. The parser is trained 
and tested on the FrameNet 1.5 full-text annota¬ 
tions. Our test set consists of the same 23 doc¬ 
uments as in Flermann et al. (2014). We inves¬ 
tigate the effect of word representation features 
on the frame identification component. We mea¬ 
sure Semafor’s performance on gold-standard tar¬ 
gets, and report the accuracy on exact matches, as 
well as on partial matches. The latter give par¬ 
tial credit to identified related frames. We use and 
modify the publicly available implementation at 

http://github.com/sammthomson/semafor. 

Our baseline features for a target Wk include: 

• Wk and WT^(k) (if parent is a preposition, 
the grandparent is taken by collapsing the de¬ 
pendency), 

• their lemmas and POS tags, 

• syntactic functions between: 

- Wk and its children, 

- Wk and Wj^(^k) (added by ourselves), 

- '«^ 7 r(fc) and its parent u;^(,r(fc))- 

5.4 Baseline word representations 

We test our model, which we call Synfunc-FImm, 
against the following baselines: 

• Baseline: no word representation features 

• Hmm: a sequential HMM 

• Tree-Hmm: a tree HMM 

We induce other representations for comparison: 


• Brown: Brown clusters 

• Dep-Brown: dependency Brown clusters 

• Skip-Gram: Skip-Gram word embeddings 

5.5 Preparing word representations 

Brown clusters. Brown clusters (Brown et al., 
1992) are known to be effective and robust when 
compared, for example, to word embeddings 
(Bansal et al., 2014; Passos et al., 2014; Nepal 
and Yates, 2014; Qu et al., 2015). The method can 
be seen as a special case of a HMM in which word 
emissions are deterministic, i.e. a word belongs to 
at most one semantic class. Recently, an extension 
has been proposed on the basis of a dependency 
language model (Suster and van Noord, 2014). We 
use the publicly available implementations. 

Following other work on English (Koo et al., 
2008; Nepal and Yates, 2014), we add both coarse- 
and fine-grained clusters as features by using pre¬ 
fixes of length 4, 6, 10 and 20 in addition to the 
complete binary tree path. Eor Dutch, coarser- 
grained clusters do not yield any improvement. 
Brown features are included in a window around 
the target word, just as the NER word features. 
When adding cluster features to the frame-semantic 
parser, we transform the cluster identifiers to one- 
hot vectors, which gives a small improvement over 
the use of indicator features. 

HMM-based models. The W-dimensional repre¬ 
sentations obtained from HMMs and their variants 
are included as N distinct continuous features. In 
the NER task, word representations are included 
at Wk and Wk+i for Dutch and at Wk for English, 
which we determined on the development set. We 
investigate state space sizes of 64, 128 and 256 
and finally choose N=128 as a reasonable trade¬ 
off between training time and quality. We use the 
same dimensionality for other word representation 
models in this paper. 

We observe that by constraining Synfunc- 
Hmm to use only the k most frequent syntactic 
functions and to treat the remaining ones as a sin¬ 
gle special syntactic function, we obtain better re¬ 
sults in our evaluation tasks. This is because for 
a model with all S syntactic functions produced 
by the parser, there is less learning evidence for 
more infrequent syntactic functions. We explore 
the effect of keeping up to five most frequent syn¬ 
tactic functions, ignoring functional ones such as 

^^http://github.com/percyliang/brown-cluster, 
http://github.com/rug-compling/dep-brown-cluster 






English 


Dutch 


MUC test set 


Model 

P 

r 

F-I 

P 

r 

F-1 

F-1 

F"ltype 

F“1 unlab 

Baseline 

80.12 

77.30 

78.69 

75.36 

70.92 

73.07 

65.44 

87.04 

96.25 

Hmm 

81.49 

78.90 

80.17 

77.61 

73.97 

75.74 

70.20 

87.66 

96.50 

Tree-Hmm 

80.49 

78.10 

79.28 

77.41 

73.48 

75.40 

65.67 

86.99 

96.53 

Syneunc-Hmm 

80.65 

78.90 

79.76 (-H.48) 

78.54 

75.23 

76.85 (+1.45) 

66.49 (+.82) 

86.93 (-.06) 

96.69 (+.16) 

Brown 

80.15 

77.28 

78.70 

77.88 

71.73 

74.68 

68.85 

87.72 

96.67 

Dep-Brown 

78.80 

75.73 

77.23 

77.50 

73.66 

75.53 

68.31 

87.44 

96.47 

Skip-Gram 

80.80 

78.98 

79.88 

76.02 

71.28 

73.57 

72.42 

88.94 

96.69 


Table 1: NER results (precision, recall and F-score) on English and Dutch test sets. Best result per column 
in bold. The score increase reported in parentheses is in comparison to Tree-Hmm. F-ltype is the F-score 
measured per word type, and F-luniab is the F-score measured per word type, ignoring labels. 


punctuation and determiner. The hnal selection 
is shown in table 2. 

English Dutch 

nmod (nominal modifier) mod (modifier) 

pmod (prepositional modifier) su (subject) 
sub (subject) obj 1 (direct object) 

cnj (conjunction) 
mwp (multiword unit) 

Table 2: Syntactic functions in Synfunc-Hmm 
for English (produced by the MST parser) and 
Dutch (produced by Alpino). 

For NER experiments, the representations from 
ah HMM models are obtained with three different 
“decoding” methods (§ 4.2). Since Post-Type is 
performing best overall, we only report the results 
for this method in the evaluation. 

Word embeddings. We use the Skip-Gram model 
presented in Mikolov et al. (2013a) (https:// 
code . google . com/p/word2vec/), trained with 
negative sampling (Mikolov et al., 2013b). The 
training seeks to maximize the dot product between 
word-context pairs encountered in the training cor¬ 
pus and minimize the dot product between those 
pairs in which the context word is randomly sam¬ 
pled. We set both the number of negative examples 
and the size of the context window to 5, the down¬ 
sampling threshold to 1 x 10“^, and the number 
of iterations to 15. 

5.6 NER results 

The results for all testsets are shown in table 
1. For English, all FIMM-based models improve 

'*We define the list of function-marker syntactic functions 
following Goldberg and Orwant (2013). 

'^While exploring the constraint on the number of syntactic 
functions, we do find that Post-Token outperforms POST- 
Type in some sets of syntactic functions, but not in the final, 
best-performing selection. 


the baseline, with the sequential Hmm achiev¬ 
ing the highest F-score. Our Synfunc-FImm 
performs on a par with Skip-Gram. It outper¬ 
forms the unlabeled-tree model, indicating that the 
added observations are useful and correctly incor¬ 
porated. Brown clusters do not exceed the Base¬ 
line score. Testing for signihcance with a boot¬ 
strap method (Spgaard et al., 2014), we find out that 
only HMM improves significantly atp < 0.01 on 
macro-Fl over BASELINE, while Skipgram and 
Synfunc-Hmm show significant improvements 
only for the location entity type. 

The general trend for Dutch is somewhat dif¬ 
ferent. Most notably, all word representations 
contribute much more effectively to the overall 
classification performance compared to English. 
The best-scoring model, our S YNFUNC-Hmm, im¬ 
proves over the baseline significantly by as much 
as about 3.8 points. Part of the reason Synfunc- 
Hmm works so well in this case is that it can make 
use of the informative “mwp” syntactic function be¬ 
tween the parts of a multiword unit. Similarly as for 
English, the unlabeled-tree HMM performs slightly 
worse than the sequential Hmm. The cluster fea¬ 
tures are more valuable here than in English, and 
we also observe a 0.7-point advantage by using de¬ 
pendency Brown clusters over the standard, bigram 
Brown clusters. The Skip-Gram model does not 
perform as well as in English, which might indicate 
that the hyper-parameters would need fine-tuning 
specific to Dutch. 

On the out-of-domain MUC dataset, tree-based 
representations appear to perform poorly, whereas 
the highest score is achieved by the Skip-Gram 
method. Unfortunately, it is difficult to gener¬ 
alize from these F-1 results alone. Concretely, 

'^However, after additional experiments we observe that 
the cluster features do improve over the baseline score when 
the number of clusters is increased. 



the dataset contains 3,518 named entities, and the 
Skip-Gram method makes 258 correct predic¬ 
tions more than Tree-Hmm. However, because 
the MUC dataset covers the narrow topic of missile- 
launch scenarios, the system gets hadly penalized if 
a mistake is made repeatedly for a certain named en¬ 
tity. For example, only the entity “NASA” occurs 
103 times, most of which are wrongly classified 
by the Tree-Hmm system, but correctly by Skip- 
Gram. The overall performance may therefore 
hinge on a limited number of frequently occurring 
entities. A workaround is to evaluate per entity type 
— calculate the F-score for each entity, then average 
over all entity types. The results for this evaluation 
scenario are reported as F-ltype. Skip-Gram still 
performs best, but the difference to other models 
is smaller. Finally, we also report F-luniab> calcu¬ 
lated as F-ltype but ignoring the actual entity label. 
So, if a named-entity token is recognized as such, 
we count it as correct prediction ignoring the en¬ 
tity label type, similarly as done by Ratinov and 
Roth (2009). Since Synfunc-Hmm performs bet¬ 
ter here, we can conclude that it is more effective 
at identifying entities rather than at labeling them. 

The fact that we observe different tendencies for 
English and Dutch can be attributed to an inter¬ 
play of factors, such as language differences (Ben¬ 
der, 2011), differently-performing syntactic parsers, 
and differences specific to the evaluation datasets. 
We briefly discuss the first possibility. It is clear 
from table 1 that all syntax-based models (Dep- 
Brown, Tree-Hmm, Synfunc-Hmm) generally 
benefit Dutch more than English. We hypothesize 
that since the word order in Dutch is generally less 
fixed than in English,'^ a sequence-based model 
for Dutch cannot capture selectional preferences 
that successfully, i.e. there is more interchanging of 
semantically diverse words in a small word window. 
This then makes the difference in performance be¬ 
tween sequential and tree models more apparent 
for Dutch. 

5.7 Semantic frame identification results 

The results are shown in table 3. The best score 
is obtained by the Skip-Gram embeddings, how¬ 
ever, the difference to other models outperform¬ 
ing the baseline is small. Eor example, Skip- 
Gram correctly identifies only two cases more 
than Dep-Brown, out of 3681 correctly disam- 

'"'por instance, it is unusual for the direct object in English 
to precede the verb, but quite common in Dutch. 


biguated frames. 

The Synfunc-Hmm model outperforms all 
other HMM models in this task. The differences 
are larger when scoring partial matches. 


Model Exact Partial 

Baseline 82.70 90.44 

Hmm 82.20 90.20 

Tree-Hmm 82.89 90.59 

Syneunc-Hmm 82.95 (-1-0.06) 90.80(1-0.21) 

Brown 83.15 90.74 

Dep-Brown 83.15 90.76 

Skip-Gram 83.19 90.91 


Table 3: Erame identification accuracy. Score in¬ 
crease in parentheses is relative to Tree-Hmm. 

5.8 Further discussion 

We can conclude from the NER experiments that 
unlabeled syntactic trees do not in general pro¬ 
vide a better structure for defining the contexts 
compared to plain sequences. The only exception 
is the case of dependency Brown clustering for 
Dutch. Comparing our results to those of Grave 
et al. (2013), we therefore cannot confirm the 
same advantage when using unlabeled-tree repre¬ 
sentations. In semantic frame identification, how¬ 
ever, the unlabeled-tree representations do compare 
more favorably to sequential representations. 

Our extension with syntactic functions outper¬ 
forms the baseline and other HMM-based represen¬ 
tations in practically all experiments. It also out¬ 
performs all other word representations in Dutch 
NER. The advantage comes from discriminating 
between the types of contexts, for example between 
a modifier and a subject, which is impossible in se¬ 
quential or unlabeled-tree HMM architectures. The 
results for English are comparable to those of the 
state-of-the-art representation methods. 

6 Related work 

HMMs have been used successfully for learning 
word representations already before, see Huang et 
al. (2014) for an overview, with an emphasis on in¬ 
vestigating domain adaptability. Models with more 

*^On exact matches, only Dep-Brown and BROWN signif¬ 
icantly outperform the baseline with the p < 0.05. On partial 
matches, Dep-Brown, Brown, Skip-Gram and Synfunc- 
Hmm all outperform the baseline significantly. Synfunc- 
Hmm performs significantly better than Tree-Hmm on partial 
matches, whereas the difference between Skip-Gram and 
Synfunc-Hmm is not significant. The significance tests were 
run using paired permutation. 



complex architecture have been proposed, such as a 
factorial HMM (Nepal and Yates, 2014), trained us¬ 
ing approximate variational inference and applied 
to POS tagging and chunking. Recently, semantic 
compositionality of HMM-based representations 
based on the framework of distributional semantics 
has been investigated by Grave et al. (2014). 

There is a long tradition of unsupervised train¬ 
ing of HMMs for POS tagging (Kupiec, 1992; 
Merialdo, 1994), with more recent work on incor¬ 
porating bias by favoring sparse posterior distri¬ 
butions within the posterior regularization frame¬ 
work (Grafa et al., 2007), and for example on 
auto-supervised refinement of HMMs (Garrette and 
Baldridge, 2012). It would be interesting to see 
how well these techniques could be applied to word 
representation learning methods like ours. 

The extension of HMMs to dependency trees for 
the purpose of word representation learning was 
first proposed by Grave et al. (2013). Although our 
baseline HMM methods, Hmm and Tree-Hmm, 
conceptually follow the models of Grave et al., 
there are still several practical differences. One 
source of differences is in the precise steps taken 
when performing Brown initialization, state split¬ 
ting, and also approximation of belief vectors dur¬ 
ing inference. Another source involves the evalua¬ 
tion setting. Their NER classifier uses only a single 
feature, and the inclusion of Brown clusters does 
not make use of the clustering hierarchy. In this 
respect, our experimental setting is more similar to 
Turian et al. (2010). Another practical difference is 
that Grave et al. concatenate words with POS-tags 
to construct the input text, whereas we use tokens 
(English) or word roots (Dutch). 

The incorporation of word representations into 
semantic frame identification has been explored in 
Hermann et al. (2014). They perform a projection 
of generic word embeddings for context words to a 
low-dimensional representation, which also learns 
an embedding for each frame label. The method 
selects the frame closest to the low-dimensional 
representation obtained through mapping of the 
input embeddings. Their approach differs from 
ours in that they induce new representations that 
are tied to a specific application, whereas we aim 
to obtain linguistically enhanced word represen¬ 
tations that can be subsequently used in a variety 
of tasks. In our case, the word representations 
are thus included as additional features in the log- 
linear model. The inclusion accounts for syntactic 


functions between the target and its context words. 
Although Hermann et al. also use syntactic func¬ 
tions, they are used to position the general word em¬ 
beddings within a single input context embedding. 
Unfortunately, we are unable to directly compare 
our results with theirs as their parser implementa¬ 
tion is proprietary. The accuracy of our baseline 
system on the test set is 0.27 percent lower in the 
exact matching regime and 0.07 lower in the partial 
matching regime compared to the baseline imple¬ 
mentation (Das et al., 2014) they used.'^ 

The topic of context type (syntactic vs. linear) 
has been abundantly treated in distributional seman¬ 
tics (Lin, 1998; Baroni and Lend, 2010; van de 
Cruys, 2010) and elsewhere (Boyd-Graber and Blei, 
2008; Tjong Kim Sang and Hofmann, 2009). 

7 Conclusion 

We have proposed an extension of a tree HMM with 
syntactic functions. The obtained word representa¬ 
tions achieve better performance than those from 
the unlabeled-tree model. Our results also show 
that simply preferring an unlabeled-tree model 
over a sequential model does not always lead to 
an improvement. An important direction for fu¬ 
ture work is to investigate how discriminating be¬ 
tween context types can lead to more accurate 
models in other frameworks. The code for obtain¬ 
ing HMM-based representations described in this 
paper is freely available at http: //github. com/ 
rug-compling/hmm-reps. 
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